Chapter 6 Exercises

These are intended to be done after completing the worked examples.

6.1 Exercise 1 — https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE119732

Using GSE119732,

  1. Map the distribution of the samples as a box plot and a density plot using only the raw data.

First, load required libraries, and load the data.

library(GEOquery)
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: generics
## 
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
##     unsplit, which.max, which.min
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
library(knitr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, setequal, union
## The following object is masked from 'package:generics':
## 
##     explain
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tibble)
if(!require(tidyverse)){
  install.packages("tidyverse",dependencies = TRUE)
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ readr     2.1.6
## ✔ ggplot2   4.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine()    masks Biobase::combine(), BiocGenerics::combine()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyverse)

gse1 <- "GSE119732"

source("./supp_functions.R")
fetch_geo_supp(gse = gse1)
## Using locally cached version of supplementary file(s) GSE119732 found here:
## data/GSE119732/GSE119732_count_table_RNA_seq.txt.gz
path1 <- file.path("data", gse1)
files1 <- list.files(path1, pattern = "\\.txt.gz$|\\.tsv.gz$|\\.csv.gz$", 
                    full.names = TRUE, recursive = TRUE)

files1
## [1] "data/GSE119732/GSE119732_count_table_RNA_seq.txt.gz"

The file is a .txt file. Using the raw data, we can map the distributions of the samples as a box plot.

gse1_data <- read_table(files1)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   gene_id = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
plot_box(gse1_data[,2:ncol(gse1_data)])

Secondly, we can also map the distributions of the samples as a density plot.

plot_density(gse1_data[,2:ncol(gse1_data)])

  1. Add colours to the plot separating the different variables in the model

The density plot already has colours (?).

  1. Filter out lowly expressed genes and re plot the distributions.

First, we convert the counts to Counts Per Million (CPM). Then, we filter out lowly expressed genes.

library(edgeR)
## Loading required package: limma
## 
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
# convert to CPM
gse1_cpm <- cpm(y = gse1_data[,2:ncol(gse1_data)])

# remove lowly expressed genes
to_remove <- edgeR::filterByExpr(gse1_cpm,min.count = 3)
## No group or design set. Assuming all samples belong to one group.
gse1_cpm_filtered <- gse1_cpm[to_remove,]

Now, we replot the samples’ distribution.

plot_box(gse1_cpm_filtered,main = "CPM filtered out lowly expressed - \nNO design matrix")

plot_density(gse1_cpm_filtered,main = "CPM filtered out lowly expressed - \n - NO design matrix")

6.2 Exercise 2 — https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122380

Using GSE122380, confirm whether the ID column contains Ensembl IDs with version suffixes.

  1. Map the distribution of the samples as a box plot and a density plot using only the raw data.

First, load the data.

library(GEOquery)
library(knitr)
library(dplyr)
library(tibble)
if(!require(tidyverse)){
  install.packages("tidyverse",dependencies = TRUE)
}
library(tidyverse)

gse2 <- "GSE122380"

source("./supp_functions.R")
fetch_geo_supp(gse = gse2)
## Using locally cached version of supplementary file(s) GSE122380 found here:
## data/GSE122380/GSE122380_Supplementary_Data_Table_S1.xlsx
## Using locally cached version of supplementary file(s) GSE122380 found here:
## data/GSE122380/GSE122380_raw_counts.txt.gz
path2 <- file.path("data", gse2)
files2 <- list.files(path2, pattern = "\\.txt.gz$|\\.tsv.gz$|\\.csv.gz$", 
                    full.names = TRUE, recursive = TRUE)

files2
## [1] "data/GSE122380/GSE122380_raw_counts.txt.gz"

First we preview the data:

gse2_data <- read_table(files2)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   Gene_id = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
rmarkdown::paged_table(gse2_data)

Let us also look if there are any Gene_ids with version suffixes:

length(grep("\\.", gse2_data$Gene_id))
## [1] 0
length(gse2_data$Gene_id)
## [1] 16319

So there are no Gene_ids with version suffixes. Next, let us plot the distributions.

plot_box(gse2_data[,2:ncol(gse2_data)])

plot_density(gse2_data[,2:ncol(gse2_data)])

There are a lot of samples, and this would definitely benefit from a design matrix.

colnames(gse2_data)[1:20]
##  [1] "Gene_id"  "18489_0"  "18489_10" "18489_11" "18489_12" "18489_13"
##  [7] "18489_14" "18489_15" "18489_1"  "18489_2"  "18489_3"  "18489_4" 
## [13] "18489_5"  "18489_6"  "18489_7"  "18489_8"  "18489_9"  "18499_0" 
## [19] "18499_10" "18499_11"

Next, we form a design matrix based on what would be the best likely grouping.

#design matrix - 
samples <- colnames(gse2_data)[2:ncol(gse2_data)]
patient_no  <- unlist(lapply(samples,FUN = function(x){unlist(strsplit(x,split = "_"))[2]}))       
sample_no <- unlist(lapply(samples,FUN = function(x){unlist(strsplit(x,split = "_"))[1]}))     
sample_data <- data.frame(samples, patient_no, sample_no)


design_gse2 <- model.matrix(~ 0 + sample_no, data = sample_data)
rownames(design_gse2) <- sample_data$samples
colnames(design_gse2) <- paste0("sample_", levels(factor(sample_no)))
head(design_gse2)
##          sample_18489 sample_18499 sample_18505 sample_18508 sample_18511
## 18489_0             1            0            0            0            0
## 18489_10            1            0            0            0            0
## 18489_11            1            0            0            0            0
## 18489_12            1            0            0            0            0
## 18489_13            1            0            0            0            0
## 18489_14            1            0            0            0            0
##          sample_18517 sample_18520 sample_18855 sample_18858 sample_18870
## 18489_0             0            0            0            0            0
## 18489_10            0            0            0            0            0
## 18489_11            0            0            0            0            0
## 18489_12            0            0            0            0            0
## 18489_13            0            0            0            0            0
## 18489_14            0            0            0            0            0
##          sample_18907 sample_18912 sample_19093 sample_19108 sample_19127
## 18489_0             0            0            0            0            0
## 18489_10            0            0            0            0            0
## 18489_11            0            0            0            0            0
## 18489_12            0            0            0            0            0
## 18489_13            0            0            0            0            0
## 18489_14            0            0            0            0            0
##          sample_19159 sample_19190 sample_19193 sample_19209
## 18489_0             0            0            0            0
## 18489_10            0            0            0            0
## 18489_11            0            0            0            0
## 18489_12            0            0            0            0
## 18489_13            0            0            0            0
## 18489_14            0            0            0            0
  1. Add colours to the plot separating the different variables in the model

The plot is already coloured. (?)

  1. Filter out lowly expressed genes and re plot the distributions.

First, we convert to CPM.

library(edgeR)

# convert to CPM
gse2_cpm <- cpm(y = gse2_data[,2:ncol(gse2_data)])

Then, we filter for lowly-expressed genes using the design matrix.

to_remove_withdesign_gse2 <- edgeR::filterByExpr(gse2_cpm,
                                            min.count = 3,
                                            design = design_gse2)
gse2_cpm_filtered_withdesign <- gse2_cpm[to_remove_withdesign_gse2,]

colnames(gse2_cpm_filtered_withdesign) <- paste(sample_data$sample_no,
                                              1:nrow(sample_data),sep = "_")

plot_box(gse2_cpm_filtered_withdesign,
         main = "CPM filtered out lowly expressed - \n - with design matrix")

plot_density(gse2_cpm_filtered_withdesign,main = "CPM filtered out lowly expressed - \n - with design matrix")

6.3 Exercise 3 -

Can you use the worked example to process the above two GEO records? How?

Yes, you can. While some steps may be different, e.g. need to change the code for creating the design matrix and reading in the data itself, the general process is the same. As such, one could use the worked example with some tweaks to process the above GEO records.